Philadelphia has experienced significant demographic and economic transformations over recent decades, leading to notable implications for its urban housing market. These shifts have resulted in variations in median house values, which serve not only as a reflection of the city’s economic health but also as a proxy for broader social and spatial dynamics. Increasing median house values may indicate an influx of higher-income residents or early stages of gentrification, whereas declining values can be symptomatic of disinvestment and economic decline. Given these dynamics, accurately forecasting median house values is vital for urban planners and policymakers who are tasked with promoting sustainable and equitable urban development.
In our previous study, Ordinary Least Squares (OLS) regression was used to explore the relationships between median house values, the dependent variable, and several key socio-economic predictors in Philadelphia. These predictors included educational attainment, vacancy rates, the proportion of detached single-family homes, and poverty rate. All these factors influenced the homes price in different ways.
Although OLS regression provides a foundational understanding of the relationships between the predictors and the dependent variables, it has limitations when applied to spatial data. One of the key assumptions of OLS regression is that observations are independent of each other and without spatial autocorrelation. However, in spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the assumption of the OLS. Violate the assumptions of OLS may lead to biased and inefficient estimates of the regression coefficients, and incorrect inferences about the relationships between the predictors and the dependent variable.
To address these limitations, this report employs advanced spatial regression techniques to predict median house values in Philadelphia. We use Spatial Lag Regression, Spatial Error Regression, and Geographically Weighted Regression (GWR) to account for spatial autocorrelation and spatial heterogeneity in the data. We examine whether those spatial regression model could accurate predict the homes price than the Ordinary Least Squares (OLS) regression models. By utilizing these spatial techniques, this study aims to improve the accuracy of the initial OLS findings and provide a more comprehensive understanding of the socio-economic and spatial factors influencing housing values. These insights will support more effective policy interventions and urban development strategies aimed at achieving equitable and sustainable growth in Philadelphia.
Spatial autocorrelation describes the degree to which a variable is correlated with itself across space. It shows the relationship of values within a single variable at nearby locations, helping in understanding patterns of spatial distribution and identifying clusters or dispersions in spatial data. The concept of spatial autocorrelation is rooted in The First Law of Geography, which states:
“Everything is related to everything else, but near things are more related than distant things.”
This principle suggests that geographically proximate areas tend to exhibit similar characteristics due to shared environmental, economic, or social factors.
Spatial autocorrelation measures how much a variable in one location is influenced by values in nearby locations. If observations that are closer to each other in space have related values, spatial autocorrelation will be positive. While if observations that are closer to each other have markedly different values, spatial autocorrelation will be negative.
Moran’s I is a widely used method for measuring spatial autocorrelation. The formula for Moran’s I is:
\[ I = \frac{N}{\sum_{i} \sum_{j} w_{ij}} \times \frac{\sum_{i} \sum_{j} w_{ij} (X_i - \bar{X}) (X_j - \bar{X})}{\sum_{i} (X_i - \bar{X})^2} \]
where:
A Moran’s I value close to +1 indicates strong positive spatial autocorrelation (clusters of similar values). A value near -1 suggests strong negative spatial autocorrelation (dispersion). A value near 0 implies no spatial autocorrelation.
When dealing with spatial data, we use spatial weight matrices to define relationships between observations. Given \(n\) observations, we construct an \(n \times n\) matrix that summarizes all pairwise spatial relationships in the dataset. These matrices are essential for estimating spatial regression models and calculating spatial autocorrelation indices.
There are several ways to define spatial relationships within a weight matrix. Queen Contiguity Matrix assigns a weight of 1 if two regions share a border or a vertex, otherwise 0. Rook Contiguity Matrix assigns a weight of 1 if two regions share only a border, otherwise 0. Distance-based Matrix assigns weights based on the inverse distance between observations.
In this report, we use the Queen contiguity weight matrix, which considers all neighboring regions that share either a boundary or a vertex.
Although we only use the queen contiguity weight matrix in the report, statisticians always use multiple spatial weight matrices to check the robustness of the results. Since different spatial weights can capture spatial dependencies at various levels of granularity, it can make sure the results are not merely an artifact of the matrix you’re using.
To determine whether spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Null Hypothesis (\(H_0\)): No spatial autocorrelation, meaning that the spatial distribution of values follows a random pattern with no systematic clustering or dispersion. Each location’s value is independent of the values at neighboring locations.
Alternative Hypothesis 1 (\(H_{a1}\)): Positive spatial autocorrelation, meaning that similar values tend to cluster together. High values are surrounded by other high values, and low values are surrounded by other low values, forming distinct spatial patterns.
Alternative Hypothesis 2 (\(H_{a2}\)): Negative spatial autocorrelation, meaning that similar values tend to disperse rather than clustered. High values are surrounded by low values and vice versa, leading to a checkerboard-like spatial distribution.
To test significance, we conduct random shuffling. Firstly, we randomly shuffle the variable values across spatial locations multiple times (999 permutations is used in the report). Then, we compute Moran’s I for each permuted dataset to generate a reference distribution. We compare the observed Moran’s I to this distribution to determine if it is extreme, concluding whether the observed clustering pattern is statistically meaningful rather than occurring by chance.
If the observed Moran’s I falls in the extreme tail of the simulated distribution, we reject the null hypothesis (H₀) in favor of the appropriate alternative hypothesis. A p-value less than 0.05 typically indicates significant spatial autocorrelation.
While global Moran’s I provides a single statistic for the entire study area, Local Indicators of Spatial Association (LISA) provides insights into the presence of spatial autocorrelation at individual locations.
To determine whether local spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Significance tests for local Moran’s I are conducted using random shuffling to ensure that detected clusters are not merely due to random chance. This process follows the same approach as global Moran’s I but involves randomly reshuffling the values of the variable across the study area while keeping the value at location \(i\) constant. By comparing the observed local Moran’s I to the distribution of values from these random permutations, statistical significance can be assessed.
If the observed \(I_i\) is extremely high or low compared to the reshuffled values, it is considered significant. The pseudosignificance value is estimated by noting the rank of the actual \(I_i\) among the permutations. For instance, if the original \(I_i\) ranks as the 97th highest among 999 permutations, the estimated pseudosignificance is \(p \approx 0.097\).
To analyze the relationship between socioeconomic factors and median house values in Philadelphia, we often use OLS (Ordinary Least Squares) Regression. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives. The key assumptions of OLS regression include:
Linearity assumes that the relationship between the dependent variable and the predictors is linear. To verify this assumption, we made scatter plots of the dependent variable against each predictor. If the relationship appears to be linear, the assumptions was met.
Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors. To check this assumption, we made a scatter plot of the standardized residuals against the predicted values. If the residuals are evenly spread around zero, the assumption was met. Any patterns may indicate the presence of heteroscedasticity.
Normality of Residuals assumes that the residuals are normally distributed. We examined the histogram of the standardized residuals to check if they are approximately normally distributed. If the histogram is bell-shaped, the assumption was met.
No Multicollinearity assumes that the predictors are not highly correlated with each other. We calculated the correlation matrix of the predictors to check for multicollinearity. If the correlation coefficients are is not greater than 0.8 or less than -0.8, the assumption was met.
No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model. Since there are over 1,700 observations in the dataset, this assumption was met.
A vital assumption of OLS regression is that observations are independent of each other. However, in spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the independence assumption. When spatial autocorrelation is present, values of a variable in nearby areas are related rather than randomly distributed.
Furthermore, when data has a spatial component, the assumption of normality of residuals often—but not necessarily—fails to hold. In some cases, spatial autocorrelation does not significantly impact regression analysis. If the dependent variable exhibits strong spatial autocorrelation while the error term does not, the regression coefficients and significance levels remain valid. Additionally, if both the dependent and independent variables share an identical spatial pattern, and the spatial dependencies in the dependent variable are fully explained by those in the independent variable, the residuals may be spatially independent. However, this is not always the case, and it is essential to test for spatial autocorrelation in residuals to ensure the validity of the model.
To test this assumption, spatial autocorrelation of the residuals can be examined using Moran’s I, which measures whether residuals are clustered, dispersed, or randomly distributed in space. As mentioned in before, it is first extract the residuals and define a spatial weights matrix (e.g., Queen or Rook contiguity). Then, Moran’s I is computed to measure the degree of clustering in residuals, with values close to +1 indicating positive spatial autocorrelation, -1 indicating negative autocorrelation, and 0 suggesting randomness.
Another method to test for spatial autocorrelation in OLS residuals
is to regress them on the residuals from nearby
observations. In this report, nearby residuals refer to
residuals from neighboring block groups, as defined by the Queen matrix.
The regression line between the residuals, OLS_RESIDU and
WT_RESIDU (weighted residuals from neighboring groups),
help identify any spatial autocorrelation. The slope
(b) of this regression represents the strength of spatial
dependence. It is calculated by estimating the relationship between the
residuals of one observation and those of its neighbors.
As we use R for analysis, it also provide methods for testing other regression assumptions.
One is Homoscedasticity, assuming that the variance of the errors (residuals) remains constant across all levels of the independent variables. In R, tests such as the Breusch-Pagan Test, Koenker-Bassett Test(also known as the Studentized Breusch-Pagan Test) or White Test are used to detect heteroscedasticity.
If the p-value is less than 0.05, then we can reject the null hypothesis for the alternate hypothesis of heteroscedasticity.
Another is Normality of Errors, assuming that residuals follow a normal distribution—a crucial requirement for valid hypothesis testing and confidence intervals. In R, tests such as the Jarque-Bera Test can be used.
The p-value determines whether the residuals follow a normal distribution. If the p-value is less than 0.05, then we can reject the Null Hypothesis of normality for the alternative hypothesis of non-normality.
##
## Monte-Carlo simulation of Moran I
##
## data: data$LNMEDHVAL
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.8, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
ggplot(data.frame(res = globalmoranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = globalmoranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Global Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))data1 <- data.frame( LNMEDHVAL = data$LNMEDHVAL, spatial_lag = lag.listw(queenlist, data$LNMEDHVAL))
ggplot(data1, aes(x = LNMEDHVAL, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Global Moran's I Scatter Plot",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))# d. Local Moran's I (LISA analysis) for LNMEHVAL
lmoran<-localmoran(data$LNMEDHVAL, queenlist)
head(lmoran)## Ii E.Ii Var.Ii Z.Ii Pr(z != E(Ii))
## 1 5.3520 -0.00304983 1.305146 4.69 0.000002767
## 2 4.4123 -0.00221660 0.759049 5.07 0.000000404
## 3 3.5007 -0.00304983 0.744493 4.06 0.000048927
## 4 2.4445 -0.00084388 0.241005 4.98 0.000000632
## 5 1.8835 -0.00109417 0.625911 2.38 0.017214326
## 6 0.0995 -0.00000161 0.000921 3.28 0.001042454
## ℹ tmap mode set to "plot".
#Obtaining the Local Moran's P-Values (two-sided)
data$lmp <- lmoran[, "Pr(z != E(Ii))"]
data <- st_make_valid(data)
#Creating the LISA Clusters
mp <- moran.plot(as.vector(scale(data$LNMEDHVAL)), queenlist)#Significance Map and Cluster Map
data$quadrant <- NA
# high-high
data[(mp$x >= 0 & mp$wx >= 0) & (data$lmp <= 0.05), "quadrant"]<- 1
# low-low
data[(mp$x <= 0 & mp$wx <= 0) & (data$lmp <= 0.05), "quadrant"]<- 2
# high-low
data[(mp$x >= 0 & mp$wx <= 0) & (data$lmp <= 0.05), "quadrant"]<- 3
# low-high
data[(mp$x <= 0 & mp$wx >= 0) & (data$lmp <= 0.05), "quadrant"]<- 4
# non-significant
data[(data$lmp > 0.05), "quadrant"] <- 5
# LISA P-Value Map
p_vals <- tm_shape(data) +
tm_polygons(col = "lmp", title = "",
breaks = c(-Inf, 0.001, 0.01, 0.05, Inf),
palette = c("darkblue", "blue", "lightblue", "white")) +
tm_layout(
legend.outside = TRUE,
legend.text.size = 1,
legend.title.size = 1,
fontfamily = "Arial",
title = "LISA P-Value Map",
title.size = 1.2,
frame = FALSE
)##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'breaks', 'palette' (rename to 'values') to
## fill.scale = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use text.fontfamily instead of fontfamily
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
# LISA Cluster Map
clusters <- tm_shape(data) +
tm_fill(col = "quadrant", title = "",
breaks = c(1, 2, 3, 4, 5, 6),
palette = c("red", "blue", "lightpink", "skyblue2", "white"),
labels = c("High-High", "Low-Low", "High-Low", "Low-High", "Non-significant")) +
tm_borders(alpha = 0.5) +
tm_layout(
frame = FALSE,
legend.outside = TRUE,
legend.text.size = 1,
legend.title.size = 1,
fontfamily = "Arial",
title = "LISA Cluster Map",
title.size = 1.2
)## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.
## [v3->v4] `tm_layout()`: use text.fontfamily instead of fontfamily
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
# e. OLS Regression Analysis
reg<-lm(LNMEDHVAL ~ LNNBELPOV+PCTBACHMOR+PCTSINGLES+PCTVACANT, data=data)
summary(reg)##
## Call:
## lm(formula = LNMEDHVAL ~ LNNBELPOV + PCTBACHMOR + PCTSINGLES +
## PCTVACANT, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2582 -0.2039 0.0382 0.2174 2.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.113778 0.046532 238.84 < 0.0000000000000002 ***
## LNNBELPOV -0.078903 0.008457 -9.33 < 0.0000000000000002 ***
## PCTBACHMOR 0.020910 0.000543 38.49 < 0.0000000000000002 ***
## PCTSINGLES 0.002977 0.000703 4.23 0.000024 ***
## PCTVACANT -0.019156 0.000978 -19.59 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.366 on 1715 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 841 on 4 and 1715 DF, p-value: <0.0000000000000002
# g. OLS residuals plotted
data$OLS_RESIDU<-rstandard(reg)
data$WT_RESIDU<-sapply(queen, function(x) mean(data$OLS_RESIDU[x]))
OLS.Residuals.Map<-tm_shape(data)+
tm_fill(col='OLS_RESIDU', style='quantile', title='Standardized OLS Residuals',
palette ='Blues')+
tm_layout(frame=FALSE, title = 'Standardised OLS Residuals')##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# scatterplot of OLS_RESIDU by WT_RESIDU
ggplot(data, aes(x = WT_RESIDU, y = OLS_RESIDU)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Scatter Plot of OLS Residuals vs. Weighted Residuals",
x = "Weighted Residuals (WT_RESIDU)",
y = "OLS Residuals (OLS_RESIDU)") +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
# Run simple regression of OLS_RESIDU on WT_RESIDU
lm_residuals <- lm(OLS_RESIDU ~ WT_RESIDU, data = data)
summary(lm_residuals)##
## Call:
## lm(formula = OLS_RESIDU ~ WT_RESIDU, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.368 -0.445 0.058 0.462 5.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0128 0.0212 -0.6 0.55
## WT_RESIDU 0.7323 0.0324 22.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.879 on 1718 degrees of freedom
## Multiple R-squared: 0.229, Adjusted R-squared: 0.228
## F-statistic: 510 on 1 and 1718 DF, p-value: <0.0000000000000002
# h. Moran’s I of the OLS regression residuals
#Regressing residuals on their nearest neighbors.
res.lm <- lm(formula=data$OLS_RESIDU ~ data$WT_RESIDU)
summary(res.lm)##
## Call:
## lm(formula = data$OLS_RESIDU ~ data$WT_RESIDU)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.368 -0.445 0.058 0.462 5.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0128 0.0212 -0.6 0.55
## data$WT_RESIDU 0.7323 0.0324 22.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.879 on 1718 degrees of freedom
## Multiple R-squared: 0.229, Adjusted R-squared: 0.228
## F-statistic: 510 on 1 and 1718 DF, p-value: <0.0000000000000002
##
## Monte-Carlo simulation of Moran I
##
## data: data$OLS_RESIDU
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.3, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided